It is common in Structural Equation Modeling (SEM) to deal with longitudinal data via a Latent Growth Curve (LGC) model. It turns out that LGC are in a sense, just a different form of the very commonly used mixed model framework. In some ways they are more flexible, mostly in the standard structural equation modeling framework that allows for indirect, and other complex covariate relationships. In other ways, they are less flexible, e.g. with missing data, estimating nonlinear relationships, incorporating with many time points, dealing with time-varying covariates. With appropriate tools there is little one can’t do with the normal mixed model approach relative to the SEM approach, and one would likely have easier interpretation. As such I’d recommend sticking with the standard mixed model framework unless you really need to, but it is useful to have both tools.
To best understand a growth curve model, I still think it’s instructive to see it from the mixed model perspective, where things are mostly interpretable from what you know from a standard linear model. We will use our GPA example from before, and one can refer to the [appendix][Appendix] for more detail.
As before we assume the following for the GPA model. As a simple starting point we merely model a trend of time (occasion- 6 semesters) and have random effects due to student for both intercept and occasion. In this setting we are treating time as numeric, but one could treat the occasion variable as categorical.
\[\mathcal{GPA} = (b_{\mathrm{intercept}} + \mathrm{re}_{\mathrm{intercept}}) + (b_{\mathrm{occ}} + \mathrm{re}_{\mathrm{occasion}})\cdot \mathrm{occasion} + \epsilon\]
\[\mathrm{re}_{\mathrm{intercept}} \sim \mathcal{N}(0, \tau)\] \[\mathrm{re}_{\mathrm{occasion}} \sim \mathcal{N}(0, \varphi)\] \[\epsilon \sim \mathcal{N}(0, \sigma)\]
Thus the student effects for the intercept and slope are random, and specifically are normally distributed with mean of zero and some estimated standard deviation (\(\tau\), \(\varphi\) respectively)1. We consider these effects as coming from unspecified, or latent, causes due to student. In addition, we have the usual residual error \(\epsilon\).
The corresponding model may be run using lme4 as follows.
load('data/gpa.RData')
# if you haven't downloaded the workshop RStudio project
# load(url('https://github.com/m-clark/mixed-models-with-R/raw/master/data/gpa.RData?raw=true'))
library(lme4)
mixed_init = lmer(gpa ~ occasion + (1 + occasion|student), data = gpa)
# summary(mixed_init)
With regard to the output, the fixed (population-average) effects are the \(b_{\mathrm{intercept}}\) and \(b_{\mathrm{occ}}\) in the previous model depiction. The standard deviations of the random effects are the \(\tau\), \(\varphi\) and \(\epsilon\).
| effect | group | term | estimate | std.error | statistic |
|---|---|---|---|---|---|
| fixed | (Intercept) | 2.60 | 0.02 | 141.59 | |
| fixed | occasion | 0.11 | 0.01 | 18.07 | |
| ran_pars | student | sd__(Intercept) | 0.21 | ||
| ran_pars | student | sd__occasion | 0.07 | ||
| ran_pars | student | cor__(Intercept).occasion | -0.10 | ||
| ran_pars | Residual | sd__Observation | 0.21 |
In SEM, we specify the latent linear model as follows.
\[Y = b_{\mathrm{intercept}} + \lambda F + \epsilon\] \[F \sim \mathcal{N}(0, \tau)\]
In the above, \(Y\) is our observed variable, \(b_{intercept}\) is the intercept as in a standard linear regression model, \(\lambda\) is the coefficient (loading in factor analysis/SEM terminology) for the latent variable, represented as \(F\). The latent variable is assumed normally distributed, with zero mean, and some estimated variance, just like the random effects in mixed models.
Note that if \(\lambda = 1\), we then have the right hand side as \(b_{intercept} + F\), and this is indistinguishable from the random intercept portion of the mixed model (\(b_{\mathrm{intercept}} + \mathrm{re}_{\mathrm{intercept}}\)). Through this that we can maybe start to get a sense of random effects as latent variables (or vice versa). Indeed, mixed models have ties to many other kinds of models (e.g. spatial, additive), because those models also add a ‘random’ component to the model in some fashion.
The graphical model for the standard LGC model resembles that of confirmatory factor analysis (CFA) with two latent variables/factors. The observed, or manifest, measures are the dependent variable values at each respective time point. However, for those familiar with structural equation modeling (SEM), growth curve models will actually look a bit different compared with typical SEM, because we have to fix the factor loadings to specific values in order to make it work for the LGC. As we will see, this also leads to non-standard output relative to other SEM models, as there is nothing to estimate for the many fixed parameters.
More specifically, we’ll have a latent variable representing the random intercepts, as well as one representing the random slopes for the longitudinal trend (time), which in the GPA data is the semester indicator. All loadings for the intercept factor are 1. The loadings for the effect of time are arbitrary, but should accurately reflect the time spacing, and typically it is good to start at zero, so that the zero has a meaningful interpretation.
As might be guessed from the above visualization, for the LGC our data needs to be in wide format, where each row represents a person and we have separate columns for each time point of the target variable, as well as any other variable that varies over time. This is contrasted with the long format we use for the mixed model, where rows represent observations at a given time point. We can use the spread function from tidyr to help with that. We end up with a data frame of two-hundred observations and columns for each semester gpa (0 through 5 for six semesters) denoted by gpa_*.
gpa_wide = gpa %>%
select(student, sex, highgpa, occasion, gpa) %>%
spread(key = occasion, value = gpa) %>%
rename_at(vars(`0`,`1`,`2`,`3`,`4`,`5`), function(x) glue::glue('gpa_{x}'))
We’ll use lavaan for our excursion into LGC. The syntax will require its own modeling code, but lavaan tries to keep to R regression model style. The names of intercept and slope are arbitrary. The =~ is just denoting that the left-hand side is the latent variable, and the right-hand side are the observed/manifest variables.
lgc_init_model = '
intercept =~ 1*gpa_0 + 1*gpa_1 + 1*gpa_2 + 1*gpa_3 + 1*gpa_4 + 1*gpa_5
occasion =~ 0*gpa_0 + 1*gpa_1 + 2*gpa_2 + 3*gpa_3 + 4*gpa_4 + 5*gpa_5
'
Now we’re ready to run the model. Note that lavaan has a specific function, growth, to use for these models. It doesn’t spare us any effort for the model syntax, but does make it unnecessary to set various arguments for the more generic sem and lavaan functions.
library(lavaan)
lgc_init = growth(lgc_init_model, data = gpa_wide)
summary(lgc_init)
lavaan 0.6-3 ended normally after 73 iterations
Optimization method NLMINB
Number of free parameters 11
Number of observations 200
Estimator ML
Model Fit Test Statistic 43.945
Degrees of freedom 16
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
intercept =~
gpa_0 1.000
gpa_1 1.000
gpa_2 1.000
gpa_3 1.000
gpa_4 1.000
gpa_5 1.000
occasion =~
gpa_0 0.000
gpa_1 1.000
gpa_2 2.000
gpa_3 3.000
gpa_4 4.000
gpa_5 5.000
Covariances:
Estimate Std.Err z-value P(>|z|)
intercept ~~
occasion 0.002 0.002 1.629 0.103
Intercepts:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.000
.gpa_1 0.000
.gpa_2 0.000
.gpa_3 0.000
.gpa_4 0.000
.gpa_5 0.000
intercept 2.598 0.018 141.956 0.000
occasion 0.106 0.005 20.338 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.080 0.010 8.136 0.000
.gpa_1 0.071 0.008 8.799 0.000
.gpa_2 0.054 0.006 9.039 0.000
.gpa_3 0.029 0.003 8.523 0.000
.gpa_4 0.015 0.002 5.986 0.000
.gpa_5 0.016 0.003 4.617 0.000
intercept 0.035 0.007 4.947 0.000
occasion 0.003 0.001 5.645 0.000
Most of the output is blank, which is needless clutter, but we do get the same five parameter values we are interested in though.
Start with the ‘intercepts’:
Intercepts:
Estimate Std.Err z-value P(>|z|)
intercept 2.598 0.018 141.956 0.000
occasion 0.106 0.005 20.338 0.000
It might be odd to call your fixed effects ‘intercepts’, but it makes sense if we are thinking of it as a multilevel model as depicted previously, where we actually broke out the random effects as a separate model. These are the population average of the random intercepts and slopes for occasion. The estimates here are pretty much spot on with our mixed model estimates.
library(lme4)
gpa_mixed = lmer(gpa ~ occasion + (1 + occasion | student), data=gpa)
summary(gpa_mixed, cor=F)
Linear mixed model fit by REML ['lmerMod']
Formula: gpa ~ occasion + (1 + occasion | student)
Data: gpa
REML criterion at convergence: 261
Scaled residuals:
Min 1Q Median 3Q Max
-3.2695 -0.5377 -0.0128 0.5326 3.1939
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.045193 0.21259
occasion 0.004504 0.06711 -0.10
Residual 0.042388 0.20588
Number of obs: 1200, groups: student, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.599214 0.018357 141.59
occasion 0.106314 0.005885 18.07
Now let’s look at the variance estimates. The estimation of residual variance for each time point in the LGC distinguishes the two approaches, but not necessarily so. We could fix them to be identical here, or conversely allow them to be estimated in the mixed model framework. Just know that’s why the results are not identical (to go along with their respective estimation approaches, which are also different by default).
Covariances:
Estimate Std.Err z-value P(>|z|)
intercept ~~
occasion 0.002 0.002 1.629 0.103
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.080 0.010 8.136 0.000
.gpa_1 0.071 0.008 8.799 0.000
.gpa_2 0.054 0.006 9.039 0.000
.gpa_3 0.029 0.003 8.523 0.000
.gpa_4 0.015 0.002 5.986 0.000
.gpa_5 0.016 0.003 4.617 0.000
intercept 0.035 0.007 4.947 0.000
occasion 0.003 0.001 5.645 0.000
print(VarCorr(gpa_mixed), comp='Var') # use print to show variance only
Groups Name Variance Corr
student (Intercept) 0.0451934
occasion 0.0045039 -0.098
Residual 0.0423879
The differences provide some insight. LGC by default assumes heterogeneous variance for each time point. Mixed models by default assume the same variance for each time point, but can allow them to be estimated separately in most modeling packages.
How can we put these models on the same footing? Let’s take a step back and do a model with only random intercepts. In this case, time is an observed measure, and has no person-specific variability. Our graphical model now looks like the following.
We can do this by fixing the slope ‘factor’ to have zero variance. However, note also that in the LGC, at each time point of the gpa outcome, we have a unique (residual) variance associated with it. Conversely, this is constant in the mixed model setting, i.e. we only have one estimate for the residual variance that does not vary by occasion. We deal with this in the LGC by giving the parameter a name and then applying it to each time point.
lgc_ran_int_model = '
intercept =~ 1*gpa_0 + 1*gpa_1 + 1*gpa_2 + 1*gpa_3 + 1*gpa_4 + 1*gpa_5
slope =~ 0*gpa_0 + 1*gpa_1 + 2*gpa_2 + 3*gpa_3 + 4*gpa_4 + 5*gpa_5
slope ~~ 0*slope # slope variance is zero
intercept ~~ 0*slope # no covariance with intercept factor
gpa_0 ~~ resid*gpa_0 # same residual variance for each time point
gpa_1 ~~ resid*gpa_1
gpa_2 ~~ resid*gpa_2
gpa_3 ~~ resid*gpa_3
gpa_4 ~~ resid*gpa_4
gpa_5 ~~ resid*gpa_5
'
Now each time point will have one variance estimate. Let’s run the LGC.
lgc_ran_int = growth(lgc_ran_int_model, data = gpa_wide)
summary(lgc_ran_int, nd=4) # increase the number of digits shown
lavaan 0.6-3 ended normally after 36 iterations
Optimization method NLMINB
Number of free parameters 9
Number of equality constraints 5
Number of observations 200
Estimator ML
Model Fit Test Statistic 338.824
Degrees of freedom 23
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
intercept =~
gpa_0 1.0000
gpa_1 1.0000
gpa_2 1.0000
gpa_3 1.0000
gpa_4 1.0000
gpa_5 1.0000
slope =~
gpa_0 0.0000
gpa_1 1.0000
gpa_2 2.0000
gpa_3 3.0000
gpa_4 4.0000
gpa_5 5.0000
Covariances:
Estimate Std.Err z-value P(>|z|)
intercept ~~
slope 0.0000
Intercepts:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.0000
.gpa_1 0.0000
.gpa_2 0.0000
.gpa_3 0.0000
.gpa_4 0.0000
.gpa_5 0.0000
intercept 2.5992 0.0217 120.0471 0.0000
slope 0.1063 0.0041 26.1094 0.0000
Variances:
Estimate Std.Err z-value P(>|z|)
slope 0.0000
.gpa_0 (resd) 0.0580 0.0026 22.3607 0.0000
.gpa_1 (resd) 0.0580 0.0026 22.3607 0.0000
.gpa_2 (resd) 0.0580 0.0026 22.3607 0.0000
.gpa_3 (resd) 0.0580 0.0026 22.3607 0.0000
.gpa_4 (resd) 0.0580 0.0026 22.3607 0.0000
.gpa_5 (resd) 0.0580 0.0026 22.3607 0.0000
intrcpt 0.0634 0.0073 8.6605 0.0000
Compare it to the corresponding mixed model.
mixed_ran_int = lmer(gpa ~ occasion + (1|student), data=gpa)
summary(mixed_ran_int, cor=F)
Linear mixed model fit by REML ['lmerMod']
Formula: gpa ~ occasion + (1 | student)
Data: gpa
REML criterion at convergence: 408.9
Scaled residuals:
Min 1Q Median 3Q Max
-3.6169 -0.6373 -0.0004 0.6361 2.8310
Random effects:
Groups Name Variance Std.Dev.
student (Intercept) 0.06372 0.2524
Residual 0.05809 0.2410
Number of obs: 1200, groups: student, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.599214 0.021696 119.8
occasion 0.106314 0.004074 26.1
# broom.mixed::tidy(comp='var') %>%
# kable_df()
Now we have essentially identical results to mixed_ran_int. The default estimation process is different for the two, resulting in some differences starting several decimal places out, but these are not meaningful differences. We can actually use the same estimator, but the results will still differ slightly due to the data differences.
Now let’s let the slope for occasion vary. We can just delete or comment out the syntax related to the (co-) variance. By default slopes and intercepts are allowed to correlate as in the mixed model. We will continue to keep the variance constant.
lgc_ran_int_ran_slope_model = '
intercept =~ 1*gpa_0 + 1*gpa_1 + 1*gpa_2 + 1*gpa_3 + 1*gpa_4 + 1*gpa_5
slope =~ 0*gpa_0 + 1*gpa_1 + 2*gpa_2 + 3*gpa_3 + 4*gpa_4 + 5*gpa_5
# slope ~~ 0*slope # slope variance is zero
# intercept ~~ 0*slope # no covariance
gpa_0 ~~ resid*gpa_0 # same residual variance for each time point
gpa_1 ~~ resid*gpa_1
gpa_2 ~~ resid*gpa_2
gpa_3 ~~ resid*gpa_3
gpa_4 ~~ resid*gpa_4
gpa_5 ~~ resid*gpa_5
'
lgc_ran_int_ran_slope = growth(lgc_ran_int_ran_slope_model, data = gpa_wide)
summary(lgc_ran_int_ran_slope, nd=4) # increase the number of digits shown
lavaan 0.6-3 ended normally after 51 iterations
Optimization method NLMINB
Number of free parameters 11
Number of equality constraints 5
Number of observations 200
Estimator ML
Model Fit Test Statistic 191.409
Degrees of freedom 21
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
intercept =~
gpa_0 1.0000
gpa_1 1.0000
gpa_2 1.0000
gpa_3 1.0000
gpa_4 1.0000
gpa_5 1.0000
slope =~
gpa_0 0.0000
gpa_1 1.0000
gpa_2 2.0000
gpa_3 3.0000
gpa_4 4.0000
gpa_5 5.0000
Covariances:
Estimate Std.Err z-value P(>|z|)
intercept ~~
slope -0.0014 0.0016 -0.8337 0.4045
Intercepts:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.0000
.gpa_1 0.0000
.gpa_2 0.0000
.gpa_3 0.0000
.gpa_4 0.0000
.gpa_5 0.0000
intercept 2.5992 0.0183 141.9471 0.0000
slope 0.1063 0.0059 18.1113 0.0000
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 (resd) 0.0424 0.0021 20.0000 0.0000
.gpa_1 (resd) 0.0424 0.0021 20.0000 0.0000
.gpa_2 (resd) 0.0424 0.0021 20.0000 0.0000
.gpa_3 (resd) 0.0424 0.0021 20.0000 0.0000
.gpa_4 (resd) 0.0424 0.0021 20.0000 0.0000
.gpa_5 (resd) 0.0424 0.0021 20.0000 0.0000
intrcpt 0.0449 0.0068 6.5992 0.0000
slope 0.0045 0.0007 6.3874 0.0000
Again, we compare the mixed model to show idential output.
mixed_ran_int_ran_slope = lmer(gpa ~ occasion + (1 + occasion|student), data=gpa)
summary(mixed_ran_int_ran_slope)
Linear mixed model fit by REML ['lmerMod']
Formula: gpa ~ occasion + (1 + occasion | student)
Data: gpa
REML criterion at convergence: 261
Scaled residuals:
Min 1Q Median 3Q Max
-3.2695 -0.5377 -0.0128 0.5326 3.1939
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.045193 0.21259
occasion 0.004504 0.06711 -0.10
Residual 0.042388 0.20588
Number of obs: 1200, groups: student, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.599214 0.018357 141.59
occasion 0.106314 0.005885 18.07
Correlation of Fixed Effects:
(Intr)
occasion -0.345
In addition, the estimated random coefficients estimates from the mixed model perfectly correlate with those of the latent variables.
| Int_mixed | Slope_mixed | Int_LGC | Slope_LGC |
|---|---|---|---|
| 2.40 | 0.17 | 2.40 | 0.17 |
| 2.39 | 0.10 | 2.39 | 0.10 |
| 2.59 | 0.15 | 2.59 | 0.15 |
| 2.51 | 0.06 | 2.51 | 0.06 |
| 2.69 | 0.08 | 2.69 | 0.08 |
| 2.39 | 0.06 | 2.39 | 0.06 |
Note that the intercept-slope relationship in the LGC is expressed as a covariance. If we want correlation, we just ask for standardized output.
summary(lgc_ran_int_ran_slope, nd=4, std=T)
lavaan 0.6-3 ended normally after 51 iterations
Optimization method NLMINB
Number of free parameters 11
Number of equality constraints 5
Number of observations 200
Estimator ML
Model Fit Test Statistic 191.409
Degrees of freedom 21
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
intercept =~
gpa_0 1.0000 0.2118 0.7170
gpa_1 1.0000 0.2118 0.7100
gpa_2 1.0000 0.2118 0.6709
gpa_3 1.0000 0.2118 0.6132
gpa_4 1.0000 0.2118 0.5508
gpa_5 1.0000 0.2118 0.4920
slope =~
gpa_0 0.0000 0.0000 0.0000
gpa_1 1.0000 0.0669 0.2241
gpa_2 2.0000 0.1337 0.4235
gpa_3 3.0000 0.2006 0.5807
gpa_4 4.0000 0.2674 0.6955
gpa_5 5.0000 0.3343 0.7764
Covariances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
intercept ~~
slope -0.0014 0.0016 -0.8337 0.4045 -0.0963 -0.0963
Intercepts:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.gpa_0 0.0000 0.0000 0.0000
.gpa_1 0.0000 0.0000 0.0000
.gpa_2 0.0000 0.0000 0.0000
.gpa_3 0.0000 0.0000 0.0000
.gpa_4 0.0000 0.0000 0.0000
.gpa_5 0.0000 0.0000 0.0000
intercept 2.5992 0.0183 141.9471 0.0000 12.2724 12.2724
slope 0.1063 0.0059 18.1113 0.0000 1.5903 1.5903
Variances:
Estimate Std.Err z-value P(>|z|) Std.lv Std.all
.gpa_0 (resd) 0.0424 0.0021 20.0000 0.0000 0.0424 0.4859
.gpa_1 (resd) 0.0424 0.0021 20.0000 0.0000 0.0424 0.4763
.gpa_2 (resd) 0.0424 0.0021 20.0000 0.0000 0.0424 0.4253
.gpa_3 (resd) 0.0424 0.0021 20.0000 0.0000 0.0424 0.3554
.gpa_4 (resd) 0.0424 0.0021 20.0000 0.0000 0.0424 0.2867
.gpa_5 (resd) 0.0424 0.0021 20.0000 0.0000 0.0424 0.2287
intrcpt 0.0449 0.0068 6.5992 0.0000 1.0000 1.0000
slope 0.0045 0.0007 6.3874 0.0000 1.0000 1.0000
The std.all is what we typically will look at.
We have demonstrated heterogenous variances [previously][Heterogeneous Variance]. But to revisit here, lme4 does not provide an easy way to have separate variance at each time point, sacrificing various model complexities for computational advantages. However, nlme provides an easy, though not straightforward way to get at these estimates. See the previous section for details.
library(nlme)
mixed_ran_int_ran_slope_hetero_var = lme(gpa ~ occasion,
random = ~ 1 + occasion | student,
data = gpa,
weights = varIdent(form = ~1|occasion))
summary(mixed_ran_int_ran_slope_hetero_var)
Linear mixed-effects model fit by REML
Data: gpa
AIC BIC logLik
135.6756 191.6481 -56.83781
Random effects:
Formula: ~1 + occasion | student
Structure: General positive-definite, Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.18748341 (Intr)
occasion 0.05743311 0.228
Residual 0.28217560
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | occasion
Parameter estimates:
0 1 2 3 4 5
1.0000000 0.9417088 0.8241164 0.6026302 0.4295708 0.4451890
Fixed effects: gpa ~ occasion
Value Std.Error DF t-value p-value
(Intercept) 2.5979778 0.018347305 999 141.59997 0
occasion 0.1064788 0.005249017 999 20.28548 0
Correlation:
(Intr)
occasion -0.277
Standardized Within-Group Residuals:
Min Q1 Med Q3 Max
-2.949990210 -0.594480704 -0.006692484 0.585054676 3.016566638
Number of Observations: 1200
Number of Groups: 200
| 0.080 |
| 0.071 |
| 0.054 |
| 0.029 |
| 0.015 |
| 0.016 |
Compare to the LGC (our lgc_init model).
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.080 0.010 8.136 0.000
.gpa_1 0.071 0.008 8.799 0.000
.gpa_2 0.054 0.006 9.039 0.000
.gpa_3 0.029 0.003 8.523 0.000
.gpa_4 0.015 0.002 5.986 0.000
.gpa_5 0.016 0.003 4.617 0.000
Within these models we can have cluster level covariates or those that vary over time. We will examine each in turn.
To add a cluster-level covariate, for a mixed model, it looks something like this (ignoring lowest level subscript, \(b_0\) = intercept):
standard random intercept
\[y = b_{0c} + b_{occ}*\mathrm{time} + e \]
\[b_{0c} = b_0 + u_{intercept}\]
Plugging in becomes:
\[y = b_0 + b_{occ}*\mathrm{occasion} + u_{intercept} + e \]
subject level covariate added
\[b_{0c} = b_0 + b_{sex}*\mathrm{sex} + u_{intercept}\]
But if we plug that into our level 1 model, it just becomes:
\[y = b_0 + b_{occ}*\mathrm{occasion} + b_{sex}*\mathrm{sex} + u_{intercept} + e\]
In our previous modeling syntax it would look like this:
gpa_mixed = lmer(gpa ~ sex + occasion + (1|student), data = gpa)
We’d have a fixed effect for sex and interpret it just like in the standard setting.
With LGC, there is a tendency to interpret the model as an SEM, with the language of effects on latent variables, and certainly one can. But adding additional covariates typically causes confusion for those not familiar with mixed models. We literally do have to regress the intercept and slope latent variables on cluster level covariates as follows.
Furthermore, people automatically put in an effect for the cluster level covariate on the slope as well as the intercept factor. In the mixed model this would result in the following:
subject level covariate added added for slopes
\[y = b_{0c} + b_{occ_c}*\mathrm{time} + e \]
\[b_{0c} = b_0 + b_{sex}*\mathrm{sex} + u_{intercept}\]
\[b_{occ_c} = b_{occ} + \gamma*\mathrm{sex} + u_{slope}\]
And after plugging in:
\[y = \color{#b2001d}{b_0 + b_{sex}*\mathrm{sex} + b_{occ}*\mathrm{occasion} + \mathbf{\gamma*\mathrm{sex}*\mathrm{occasion}}} + \color{#001eb2}{u_{intercept} + u_{slope}*\mathrm{occasion}} + e\]
The fixed effects are in red, while the random effects are in blue. Focussing on the fixed effects, we can see that this warrants an interaction between sex and occasion. This is not required, but one should add it if they actually are interested in the interaction. Our graphical model looks like the following using the above notation.
We are now ready to run the LGC for comparison.
lgc_cluster_level_model <- '
intercept =~ 1*gpa_0 + 1*gpa_1 + 1*gpa_2 + 1*gpa_3 + 1*gpa_4 + 1*gpa_5
occasion =~ 0*gpa_0 + 1*gpa_1 + 2*gpa_2 + 3*gpa_3 + 4*gpa_4 + 5*gpa_5
# regressions
intercept ~ sex
occasion ~ sex
gpa_0 ~~ resid*gpa_0 # same residual variance for each time point
gpa_1 ~~ resid*gpa_1
gpa_2 ~~ resid*gpa_2
gpa_3 ~~ resid*gpa_3
gpa_4 ~~ resid*gpa_4
gpa_5 ~~ resid*gpa_5
'
lgc_cluster_level = growth(lgc_cluster_level_model, data = gpa_wide, estimator='ML')
summary(lgc_cluster_level)
lavaan 0.6-3 ended normally after 47 iterations
Optimization method NLMINB
Number of free parameters 13
Number of equality constraints 5
Number of observations 200
Estimator ML
Model Fit Test Statistic 193.415
Degrees of freedom 25
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
intercept =~
gpa_0 1.000
gpa_1 1.000
gpa_2 1.000
gpa_3 1.000
gpa_4 1.000
gpa_5 1.000
occasion =~
gpa_0 0.000
gpa_1 1.000
gpa_2 2.000
gpa_3 3.000
gpa_4 4.000
gpa_5 5.000
Regressions:
Estimate Std.Err z-value P(>|z|)
intercept ~
sex 0.076 0.036 2.083 0.037
occasion ~
sex 0.029 0.012 2.499 0.012
Covariances:
Estimate Std.Err z-value P(>|z|)
.intercept ~~
.occasion -0.002 0.002 -1.184 0.237
Intercepts:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.000
.gpa_1 0.000
.gpa_2 0.000
.gpa_3 0.000
.gpa_4 0.000
.gpa_5 0.000
.intercept 2.484 0.058 42.671 0.000
.occasion 0.062 0.019 3.349 0.001
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 (resd) 0.042 0.002 20.000 0.000
.gpa_1 (resd) 0.042 0.002 20.000 0.000
.gpa_2 (resd) 0.042 0.002 20.000 0.000
.gpa_3 (resd) 0.042 0.002 20.000 0.000
.gpa_4 (resd) 0.042 0.002 20.000 0.000
.gpa_5 (resd) 0.042 0.002 20.000 0.000
.intrcpt 0.043 0.007 6.525 0.000
.occasin 0.004 0.001 6.273 0.000
Applied researchers commonly have difficulty interpreting the model due to past experience with SEM. While these are latent variables, they aren’t just latent variables or underlying constructs. It doesn’t help that the output can be confusing, because now one has an ‘intercept for your intercepts’ and an ‘intercept for your slopes’. In the multilevel context it makes sense, but there you know ‘intercept’ is just ‘fixed effect’.
This is the corresponding mixed model:
mixed_cluster_level_cov = lmer(gpa ~ sex + occasion + sex:occasion + (1 + occasion|student), data = gpa)
summary(mixed_cluster_level_cov, cor=F)
Linear mixed model fit by REML ['lmerMod']
Formula: gpa ~ sex + occasion + sex:occasion + (1 + occasion | student)
Data: gpa
REML criterion at convergence: 256.7
Scaled residuals:
Min 1Q Median 3Q Max
-3.2556 -0.5409 -0.0142 0.5407 3.2263
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.044096 0.20999
occasion 0.004328 0.06579 -0.14
Residual 0.042388 0.20588
Number of obs: 1200, groups: student, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.559549 0.026418 96.888
sexfemale 0.075553 0.036460 2.072
occasion 0.091128 0.008429 10.811
sexfemale:occasion 0.028927 0.011634 2.486
Similarly, if we had a time-varying covariate, average weekly hours spent in the library, it’d look like the following. The gpa data doesn’t really come with a useful time-varying covariate, so I’ve added one for this demo.
summary(gpa_mixed_tvc, cor=F)
Linear mixed model fit by REML ['lmerMod']
Formula: gpa ~ occasion + lib_hours + (1 + occasion | student)
Data: gpa
REML criterion at convergence: 48.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.4105 -0.5185 -0.0023 0.5202 2.9575
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.033135 0.18203
occasion 0.002817 0.05307 -0.13
Residual 0.037591 0.19388
Number of obs: 1200, groups: student, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.385494 0.021079 113.17
occasion 0.082838 0.005196 15.94
lib_hours 0.032216 0.002024 15.92
Though we could have a random slope for hours if we wanted. You get the picture. Most of the model is still standard regression interpretation.
With time varying covariates, i.e. those that can have a different value at each time point, the syntax starts to get tedious. Here we add just one such covariate, \(c\).
lgc_tvc_model <- '
intercept =~ 1*gpa_0 + 1*gpa_1 + 1*gpa_2 + 1*gpa_3 + 1*gpa_4 + 1*gpa_5
occasion =~ 0*gpa_0 + 1*gpa_1 + 2*gpa_2 + 3*gpa_3 + 4*gpa_4 + 5*gpa_5
# time-varying covariates
gpa_0 ~ lib_hours_0
gpa_1 ~ lib_hours_1
gpa_2 ~ lib_hours_2
gpa_3 ~ lib_hours_3
gpa_4 ~ lib_hours_4
gpa_5 ~ lib_hours_5
gpa_0 ~~ resid*gpa_0 # same residual variance for each time point
gpa_1 ~~ resid*gpa_1
gpa_2 ~~ resid*gpa_2
gpa_3 ~~ resid*gpa_3
gpa_4 ~~ resid*gpa_4
gpa_5 ~~ resid*gpa_5
'
lgc_tvc <- growth(lgc_tvc_model, data=gpa_wide)
summary(lgc_tvc)
lavaan 0.6-3 ended normally after 91 iterations
Optimization method NLMINB
Number of free parameters 17
Number of equality constraints 5
Number of observations 200
Estimator ML
Model Fit Test Statistic 341.888
Degrees of freedom 51
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
intercept =~
gpa_0 1.000
gpa_1 1.000
gpa_2 1.000
gpa_3 1.000
gpa_4 1.000
gpa_5 1.000
occasion =~
gpa_0 0.000
gpa_1 1.000
gpa_2 2.000
gpa_3 3.000
gpa_4 4.000
gpa_5 5.000
Regressions:
Estimate Std.Err z-value P(>|z|)
gpa_0 ~
lib_hours_0 0.045 0.004 10.701 0.000
gpa_1 ~
lib_hours_1 0.039 0.003 13.514 0.000
gpa_2 ~
lib_hours_2 0.033 0.002 13.752 0.000
gpa_3 ~
lib_hours_3 0.028 0.003 11.271 0.000
gpa_4 ~
lib_hours_4 0.024 0.003 8.527 0.000
gpa_5 ~
lib_hours_5 0.022 0.003 6.348 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
intercept ~~
occasion -0.001 0.001 -0.656 0.512
Intercepts:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.000
.gpa_1 0.000
.gpa_2 0.000
.gpa_3 0.000
.gpa_4 0.000
.gpa_5 0.000
intercept 2.300 0.030 76.682 0.000
occasion 0.122 0.011 11.123 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 (resd) 0.036 0.002 20.000 0.000
.gpa_1 (resd) 0.036 0.002 20.000 0.000
.gpa_2 (resd) 0.036 0.002 20.000 0.000
.gpa_3 (resd) 0.036 0.002 20.000 0.000
.gpa_4 (resd) 0.036 0.002 20.000 0.000
.gpa_5 (resd) 0.036 0.002 20.000 0.000
intrcpt 0.030 0.005 6.019 0.000
occasin 0.003 0.001 5.975 0.000
Note the problem here is similar to that with the residual variances. Unless we fix the coefficient to be constant, this is akin to having an interaction of the time-varying covariate with a categorical form of time. So in the same model we flip from considering time as a numeric and linear effect on the outcome, to one that is categorical. This is rarely done in standard regression models, though for some reason standard for the LGC setting. The following will get us back to the comparable mixed model.
lgc_tvc <- growth(lgc_tvc_model, data=gpa_wide)
summary(lgc_tvc)
lavaan 0.6-3 ended normally after 54 iterations
Optimization method NLMINB
Number of free parameters 17
Number of equality constraints 10
Number of observations 200
Estimator ML
Model Fit Test Statistic 357.578
Degrees of freedom 56
P-value (Chi-square) 0.000
Parameter Estimates:
Information Expected
Information saturated (h1) model Structured
Standard Errors Standard
Latent Variables:
Estimate Std.Err z-value P(>|z|)
intercept =~
gpa_0 1.000
gpa_1 1.000
gpa_2 1.000
gpa_3 1.000
gpa_4 1.000
gpa_5 1.000
occasion =~
gpa_0 0.000
gpa_1 1.000
gpa_2 2.000
gpa_3 3.000
gpa_4 4.000
gpa_5 5.000
Regressions:
Estimate Std.Err z-value P(>|z|)
gpa_0 ~
lb_hr_0 (lh_c) 0.032 0.002 15.951 0.000
gpa_1 ~
lb_hr_1 (lh_c) 0.032 0.002 15.951 0.000
gpa_2 ~
lb_hr_2 (lh_c) 0.032 0.002 15.951 0.000
gpa_3 ~
lb_hr_3 (lh_c) 0.032 0.002 15.951 0.000
gpa_4 ~
lb_hr_4 (lh_c) 0.032 0.002 15.951 0.000
gpa_5 ~
lb_hr_5 (lh_c) 0.032 0.002 15.951 0.000
Covariances:
Estimate Std.Err z-value P(>|z|)
intercept ~~
occasion -0.001 0.001 -0.973 0.331
Intercepts:
Estimate Std.Err z-value P(>|z|)
.gpa_0 0.000
.gpa_1 0.000
.gpa_2 0.000
.gpa_3 0.000
.gpa_4 0.000
.gpa_5 0.000
intercept 2.385 0.021 113.389 0.000
occasion 0.083 0.005 15.983 0.000
Variances:
Estimate Std.Err z-value P(>|z|)
.gpa_0 (resd) 0.038 0.002 20.000 0.000
.gpa_1 (resd) 0.038 0.002 20.000 0.000
.gpa_2 (resd) 0.038 0.002 20.000 0.000
.gpa_3 (resd) 0.038 0.002 20.000 0.000
.gpa_4 (resd) 0.038 0.002 20.000 0.000
.gpa_5 (resd) 0.038 0.002 20.000 0.000
intrcpt 0.033 0.005 6.148 0.000
occasin 0.003 0.001 5.523 0.000
summary(gpa_mixed_tvc)
Linear mixed model fit by REML ['lmerMod']
Formula: gpa ~ occasion + lib_hours + (1 + occasion | student)
Data: gpa
REML criterion at convergence: 48.5
Scaled residuals:
Min 1Q Median 3Q Max
-3.4105 -0.5185 -0.0023 0.5202 2.9575
Random effects:
Groups Name Variance Std.Dev. Corr
student (Intercept) 0.033135 0.18203
occasion 0.002817 0.05307 -0.13
Residual 0.037591 0.19388
Number of obs: 1200, groups: student, 200
Fixed effects:
Estimate Std. Error t value
(Intercept) 2.385494 0.021079 113.17
occasion 0.082838 0.005196 15.94
lib_hours 0.032216 0.002024 15.92
Correlation of Fixed Effects:
(Intr) occasn
occasion -0.122
lib_hours -0.637 -0.284
Now imagine having just a few of those kinds of variables as would be common in most longitudinal settings. In the mixed model framework one would add them in as any covariate in a regression model, and each covariate would be associated with a single fixed effect. In the LGC framework, one has to regress each time point for the target variable on its corresponding predictor time point. It might take a few paragraphs to explain the coefficients for just a handful of covariates. If you fix them to a single value, you would duplicate the mixed model, but the syntax requires even more tedium.
One difference seen in comparing LGC models vs. mixed models is that in the former, random slopes are always assumed, whereas in the latter, one would typically see if it’s worth adding random slopes in the first place, or simply not assume them.
The SEM framework is inherently multivariate, and your data will need to be in wide format. This isn’t too big of a deal until you have many time-varying covariates, then the model syntax is tedious and you end up having the number of parameters to estimate climb rapidly.
As we have noted before, SEM is inherently a large sample technique. The growth curve model does not require as much for standard approaches, but may require a lot more depending on the model one tries to estimate. In my own simulations, I haven’t seen too much difference compared to mixed models even for notably small sample sizes, but those were for very simple models.
A basic growth curve model requires four time points to incorporate the flexibility that would make it worthwhile. Mixed models don’t have the restriction (outside of the obvious need of two).
Mixed models can run even if some clusters have a single value. SEM requires balanced data and so one will always have to estimate missing values or drop them. Whether this missingness can be ignored in the standard mixed model framework is a matter of some debate in certain circles.
Numbering your time from zero makes sense in both worlds. This leads to the natural interpretation that the intercept is the mean for your first time point. In other cases having a centered value would make sense, or numbering from 0 to a final value of 1, which would mean the slope coefficient represents the change over the whole time span.
Between lme4 and nlme, you can do any standard LGC. Besides that, various packages provide functionality that some might think is only done with SEM software. One package I highly recommend is brms, as it builds on many other packages that incorporate a mixed model approach in the Bayesian framework. The others are ones that come to mind off the top of my head, and in some cases should be seen as a starting point only.
Note also, unless you are incorporating latent variables, e.g. from scale measurements, there is little need to use something like lavaan or Mplus for standard mixed/multilevel modeling (i.e. in the long data framework), though they have such functionality. The Mplus manual also lumps survival and standard time series in the chapter on longitudinal and related models. I personally can’t see a scenario where I would use Mplus for survival or time series given the multitude of easier to use, more flexible and more powerful options in R.
In short, you’d need a very complicated growth model to require moving from the mixed model framework to SEM-specific software, one that combines potentially already complex modeling situations.
Growth curve modeling is an alternative way to do what is very commonly accomplished through mixed models, and allow for more complex models than typically seen for standard mixed models. One’s default should probably be to use the more common, and probably more flexible (in most situations), mixed modeling tools, where there are packages in R that could handle nonlinear effects, mediation and multivariate outcomes for mixed models. I have other documents regarding mixed models on my website and code at GitHub, including a document that does more comparison to growth curve models. However, the latent variable approach may provide what you need, and at the very least gives you a fresh take on the standard mixed model perspective.
Usually we would draw both random effects from a multivariate normal distribution with some covariance.↩